Skip to content

Comments

Audio Support for Scope#480

Draft
BuffMcBigHuge wants to merge 10 commits intomainfrom
marco/feat/audio
Draft

Audio Support for Scope#480
BuffMcBigHuge wants to merge 10 commits intomainfrom
marco/feat/audio

Conversation

@BuffMcBigHuge
Copy link
Collaborator

@BuffMcBigHuge BuffMcBigHuge commented Feb 17, 2026

Audio Support for Scope

Summary

Adds end-to-end audio support to Scope's WebRTC streaming pipeline. Pipelines can now return audio alongside video in their output dict; the server buffers, resamples, and streams audio over WebRTC and NDI. A shared media clock keeps audio and video synchronized.

What's New

Backend

  • Pipeline interface: Pipelines may return {"video": ..., "audio": ..., "audio_sample_rate": ...}. Audio keys are optional; pipelines that don't produce audio are unchanged.
  • PipelineProcessor: New audio_output_queue for audio chunks from the pipeline output dict.
  • FrameProcessor: Audio drain thread reads from the last processor's audio queue, resamples to 48 kHz (WebRTC standard), mixes to mono, and buffers 20 ms chunks. get_audio() drains the buffer for the audio track.
  • MediaClock: Shared clock for A/V sync. Both video and audio tracks derive PTS from get_media_time() so RTCP Sender Reports map correctly to NTP.
  • AudioProcessingTrack: aiortc MediaStreamTrack that produces 20 ms frames at 48 kHz. Returns silence when no audio is available to keep the track alive.
  • NDI: NDIOutputSink.send_audio() sends float32 audio via NDIlib_send_send_audio_v2.

Frontend

  • VideoOutput: Mute/unmute toggle (speaker icon). Starts muted to satisfy browser autoplay policy; user can unmute once the stream is playing.
  • useUnifiedWebRTC: Merges video and audio tracks into a single MediaStream. Adds a recvonly audio transceiver so the SDP offer includes an audio m-line for the backend to attach its track.

WebRTC Handshake

The browser adds addTransceiver("audio", { direction: "recvonly" }) so the offer includes an audio m-line. After setRemoteDescription, the backend finds the audio transceiver, attaches its AudioProcessingTrack, and sets direction to sendonly. The answer then indicates that the server will send audio.

Architecture

Pipeline.__call__() → {"video": tensor, "audio": tensor, "audio_sample_rate": int}
    │
    ▼
PipelineProcessor.audio_output_queue
    │
    ▼
FrameProcessor._audio_drain_loop (resample to 48 kHz, buffer)
    │
    ├──► AudioProcessingTrack.recv() → WebRTC
    └──► NDIOutputSink.send_audio() → NDI

Related

  • Architecture doc.
  • Pipelines that produce audio (e.g. LTX-2) are wired separately; this PR provides the infrastructure.

…ia clock in webrtc.

Signed-off-by: BuffMcBigHuge <marco@bymar.co>
Signed-off-by: BuffMcBigHuge <marco@bymar.co>
Signed-off-by: BuffMcBigHuge <marco@bymar.co>
Signed-off-by: BuffMcBigHuge <marco@bymar.co>
Signed-off-by: BuffMcBigHuge <marco@bymar.co>
Signed-off-by: BuffMcBigHuge <marco@bymar.co>
Signed-off-by: BuffMcBigHuge <marco@bymar.co>
Signed-off-by: BuffMcBigHuge <marco@bymar.co>
Copy link
Collaborator

@leszko leszko left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added some minor comments, I'll review it more in detail soon. But the general code structure looks fine.

The biggest thing is that I'm not sure the synchronization work fine. @BuffMcBigHuge I tested this on your LTX-2 and I found the video completely not synchronized with audio, so the experience is not good. The video pauses then the audio is still playing, etc.. Look at the recording I share. This is something we need to address.

demo_audio.mp4

logger.error(f"Error sending NDI frame: {e}")
return False

def send_audio(
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice that you added the audio for NDI as well 🏅

Comment on lines +517 to +519
if not self.pipeline_processors:
time.sleep(0.01)
continue
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we maybe start the audio loop AFTER the pipeline processors are created, then we wouldn't need to have this check here.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I had struggled with this actually - I'll look into your suggestion.

Comment on lines +531 to +534
if isinstance(audio_tensor, torch.Tensor):
audio_np = audio_tensor.float().numpy()
else:
audio_np = np.asarray(audio_tensor, dtype=np.float32)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we accept the pipelines to create audio in two formats? Or if not, then why would need this check?

Comment on lines +536 to +544
# Ensure shape is [C, S] (channels, samples)
if audio_np.ndim == 1:
audio_np = audio_np[np.newaxis, :] # mono -> [1, S]

# Mix down to mono for WebRTC (average channels)
if audio_np.shape[0] > 1:
audio_mono = audio_np.mean(axis=0)
else:
audio_mono = audio_np[0]
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Similar question, we have a lot of checks for the audio tensor shape, do we allow returning different formats?


return frame

def _audio_drain_loop(self):
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nit: I wonder if we could/should move the audio related code into a separate file, like audio.py. The main reason is that frame_processor.py becomes busy.

if audio_output is not None and audio_sample_rate is not None:
# Detach and move to CPU for downstream consumption
audio_output = audio_output.detach().cpu()
logger.info(
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it should be at the debug log, because the logs get too noisy.

self._audio_buffer.append(audio_mono)
self._audio_buffer_samples += len(audio_mono)
logger.info(
f"[FRAME-PROCESSOR] Audio buffered: {len(audio_mono)} samples "
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this should be a the the debug level, because the logs get too noisy.

Copy link

@j0sh j0sh left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

One note on timing / synchronization. Do the audio and video pipelines run independently?

As far as I can tell, media is timestamped right before sending out to WebRTC, in which case there will likely be desync if the pipeline for one track is delayed compared to another. You might want to propagate a reference timestamp at the beginning of both pipelines earlier in the process, but I don't really know this codebase well enough to suggest where.

Re: sync, there might also be more subtle WebRTC usage issues but I am not sure yet; things look mostly fine from what I can tell. WebRTC playback sync is complex and some of the knobs in frameworks like Pion are a little unintuitive; I'm not too familiar with aiortc at the moment.

Comment on lines 261 to 262
media_time = self.media_clock.get_media_time()
frame.pts = self.media_clock.media_time_to_audio_pts(media_time)
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think the corresponding call for video is missing?

nit: since media_time_to_audio_pts is called immediately is after get_media_time, I think this API can be simplified to something like

frame.pts = self.media_clock.to_pts(AUDIO_CLOCK_RATE)

Since AUDIO_CLOCK_RATE is being manually set elsewhere anyway, I am not sure if there really needs to be separate API entry points for audio / video

Signed-off-by: BuffMcBigHuge <marco@bymar.co>
Signed-off-by: BuffMcBigHuge <marco@bymar.co>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants